Native Language Identification Using a Mixture of Character and Word N-grams

نویسندگان

  • Elham Mohammadi
  • Hadi Veisi
  • Hessam Amini
چکیده

Native language identification (NLI) is the task of determining an author’s native language, based on a piece of his/her writing in a second language. In recent years, NLI has received much attention due to its challenging nature and its applications in language pedagogy and forensic linguistics. We participated in the NLI Shared Task 2017 under the name UT-DSP. In our effort to implement a method for native language identification, we made use of a mixture of character and word Ngrams, and achieved an optimal F1-score of 0.7748, using both essay and speech transcription datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CIC-FBK Approach to Native Language Identification

We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are novel in this task, such as typed character n-gram...

متن کامل

Native Language Identification using large scale lexical features

This paper describes an effort to perform Native Language Identification (NLI) using machine learning on a large amount of lexical features. The features were collected from sequences and collocations of bare word forms, suffixes and character n-grams amounting to a feature set of several hundred thousand features. These features were used to train a linear Support Vector Machine (SVM) classifi...

متن کامل

VTEX System Description for the NLI 2013 Shared Task

This paper describes the system developed for the NLI 2013 Shared Task, requiring to identify a writer’s native language by some text written in English. I explore the given manually annotated data using word features such as the length, endings and character trigrams. Furthermore, I employ k-NN classification. Modified TFIDF is used to generate a stop-word list automatically. The distance betw...

متن کامل

A study of N-gram and Embedding Representations for Native Language Identification

We report on our experiments with Ngram and embedding based feature representations for Native Language Identification (NLI) as a part of the NLI Shared Task 2017 (team name: NLI-ISU). Our best performing system on the test set for written essays had a macro F1 of 0.8264 and was based on word uni, bi and trigram features. We explored n-grams covering word, character, POS and word-POS mixed repr...

متن کامل

Native Language Identification using Phonetic Algorithms

In this paper, we discuss the results of the IUCL system in the NLI Shared Task 2017. For our system, we explore a variety of phonetic algorithms to generate features for Native Language Identification. These features are contrasted with one of the most successful type of features in NLI, character n-grams. We find that although phonetic features do not perform as well as character n-grams alon...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017